AITopics

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.67)

Industry:

Education > Assessment & Standards (0.68)
Education > Educational Setting > K-12 Education > Secondary School (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Neural Information Processing SystemsFeb-18-2026, 05:20:49 GMT

cdf6f8e9fd9aeaf79b6024caec24f15b-Paper-Datasets_and_Benchmarks_Track.pdf

To ensure quality, all charts and questions are handpicked, curated, and verified by human experts.

large language model, machine learning, natural language, (20 more...)

Country:

Europe > Italy (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
(9 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Law (1.00)
Banking & Finance (0.92)
Government (0.92)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(5 more...)

Neural Information Processing SystemsFeb-8-2026, 00:24:08 GMT

117c5c8622b0d539f74f6d1fb082a2e9-Paper-Datasets_and_Benchmarks.pdf

dataset, evaluation, llm, (15 more...)

Country:

Asia > Thailand (0.05)
Africa > Kenya (0.04)
Asia > China > Beijing > Beijing (0.04)
(12 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Health & Medicine (0.67)
Education > Assessment & Standards (0.67)
Education > Educational Setting > K-12 Education > Secondary School (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Byun, Grace, Rajwal, Swati, Choi, Jinho D.

LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation

arXiv.org Artificial IntelligenceNov-19-2025

Large Language Models (LLMs) are increasingly explored for educational tasks such as grading, yet their alignment with human evaluation in real classrooms remains underexamined. In this study, we investigate the feasibility of using an LLM (GPT-4o) to evaluate short-answer quizzes and project reports in an undergraduate Computational Linguistics course. We collect responses from approximately 50 students across five quizzes and receive project reports from 14 teams. LLM-generated scores are compared against human evaluations conducted independently by the course teaching assistants (TAs). Our results show that GPT-4o achieves strong correlation with human graders (up to 0.98) and exact score agreement in 55\% of quiz cases. For project reports, it also shows strong overall alignment with human grading, while exhibiting some variability in scoring technical, open-ended responses. We release all code and sample data to support further research on LLMs in educational assessment. This work highlights both the potential and limitations of LLM-based grading systems and contributes to advancing automated grading in real-world academic settings.

large language model, machine learning, natural language, (19 more...)

2511.10819

Country: North America > United States (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.69)
Education > Assessment & Standards > Student Performance (0.68)
Education > Educational Setting > Higher Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Eshuijs, Leon, Chaudhury, Archie, McBeth, Alan, Nguyen, Ethan

But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors

arXiv.org Artificial IntelligenceNov-7-2025

Detecting subtle forms of dishonesty like sycophancy and manipulation in Large Language Models (LLMs) remains challenging for both humans and automated evaluators, as these behaviors often appear through small biases rather than clear false statements. We introduce Judge Using Safety-Steered Alternatives (JUSSA), a novel framework that employs steering vectors not to improve model behavior directly, but to enhance LLM judges' evaluation capabilities. JUSSA applies steering vectors during inference to generate more honest alternatives, providing judges with contrastive examples that make subtle dishonest patterns easier to detect. While existing evaluation methods rely on black-box evaluation, JUSSA leverages model internals to create targeted comparisons from single examples. We evaluate our method on sycophancy detection and introduce a new manipulation dataset covering multiple types of manipulation. Our results demonstrate that JUSSA effectively improves detection accuracy over single-response evaluation in various cases. Analysis across judge models reveals that JUSSA helps weaker judges on easier dishonesty detection tasks, and stronger judges on harder tasks. Layer-wise experiments show how dishonest prompts cause representations to diverge from honest ones in middle layers, revealing where steering interventions are most effective for generating contrastive examples. By demonstrating that steering vectors can enhance safety evaluation rather than just modify behavior, our work opens new directions for scalable model auditing as systems become increasingly sophisticated.

large language model, machine learning, natural language, (17 more...)

2505.1776

Country: Europe > Austria (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Banking & Finance (0.93)
Education (0.68)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.46)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

arXiv.org Artificial IntelligenceOct-28-2025

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

Hu, Tiancheng, Baumann, Joachim, Lupo, Lorenzo, Collier, Nigel, Hovy, Dirk, Röttger, Paul

Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size. Simulation performance is not improved by increased inference-time compute. We demonstrate an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.

large language model, machine learning, natural language, (17 more...)

2510.17516

Country:

North America > United States (1.00)
Europe (1.00)
Asia (1.00)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Questionnaire & Opinion Survey (1.00)

Industry:

Government (1.00)
Law (0.93)
Transportation > Ground (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Neural Information Processing SystemsOct-10-2025, 16:59:51 GMT

cdf6f8e9fd9aeaf79b6024caec24f15b-Paper-Datasets_and_Benchmarks_Track.pdf

To ensure quality, all charts and questions are handpicked, curated, and verified by human experts.

claude 3, final answer, subplot, (13 more...)

Country:

Europe > Italy (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
(9 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Law (1.00)
Banking & Finance (0.92)
Government (0.92)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(5 more...)

arXiv.org Artificial IntelligenceOct-7-2025

Format Inertia: A Failure Mechanism of LLMs in Medical Pre-Consultation

Lim, Seungseop, Kim, Gibaeg, Han, Wooseok, Seo, Jean, Lee, Hyunkyung, Yoo, Jaehyo, Yang, Eunho

Recent advances in Large Language Models (LLMs) have brought significant improvements to various service domains, including chatbots and medical pre-consultation applications. In the healthcare domain, the most common approach for adapting LLMs to multi-turn dialogue generation is Supervised Fine-Tuning (SFT). However, datasets for SFT in tasks like medical pre-consultation typically exhibit a skewed turn-count distribution. Training on such data induces a novel failure mechanism we term Format Inertia, where models tend to generate repetitive, format-correct, but diagnostically uninformative questions in long medical dialogues. To mitigate this observed failure mechanism, we adopt a simple, data-centric method that rebalances the turn-count distribution of the training dataset. Experimental results show that our approach substantially alleviates Format Inertia in medical pre-consultation.

artificial intelligence, large language model, natural language, (14 more...)

2510.01688

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area (0.47)
Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Neural Information Processing SystemsAug-22-2025, 01:07:54 GMT

9ca22870ae0ba55ee50ce3e2d269e5de-Supplemental-Datasets_and_Benchmarks.pdf

artificial intelligence, benchmark, natural language, (18 more...)

Country:

North America > United States (0.14)
Europe > Middle East (0.04)
Asia > Middle East (0.04)
Africa > Middle East (0.04)

Industry: Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Communications > Social Media (0.68)

arXiv.org Artificial IntelligenceMay-1-2025

TreeHop: Generate and Filter Next Query Embeddings Efficiently for Multi-hop Question Answering

Li, Zhonghao, Zhang, Kunpeng, Ou, Jinghuai, Liu, Shuliang, Hu, Xuming

Retrieval-augmented generation (RAG) systems face significant challenges in multi-hop question answering (MHQA), where complex queries require synthesizing information across multiple document chunks. Existing approaches typically rely on iterative LLM-based query rewriting and routing, resulting in high computational costs due to repeated LLM invocations and multi-stage processes. To address these limitations, we propose TreeHop, an embedding-level framework without the need for LLMs in query refinement. TreeHop dynamically updates query embeddings by fusing semantic information from prior queries and retrieved documents, enabling iterative retrieval through embedding-space operations alone. This method replaces the traditional "Retrieve-Rewrite-Vectorize-Retrieve" cycle with a streamlined "Retrieve-Embed-Retrieve" loop, significantly reducing computational overhead. Moreover, a rule-based stop criterion is introduced to further prune redundant retrievals, balancing efficiency and recall rate. Experimental results show that TreeHop rivals advanced RAG methods across three open-domain MHQA datasets, achieving comparable performance with only 5\%-0.4\% of the model parameter size and reducing the query latency by approximately 99\% compared to concurrent approaches. This makes TreeHop a faster and more cost-effective solution for deployment in a range of knowledge-intensive applications. For reproducibility purposes, codes and data are available here: https://github.com/allen-li1231/TreeHop-RAG.

large language model, machine learning, natural language, (20 more...)

2504.20114

Country:

North America > United States (0.46)
Asia > Middle East (0.28)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)